Memory Efficient De Bruijn Graph Construction

نویسندگان

  • Yang Li
  • Pegah Kamousi
  • Fangqiu Han
  • Shengqi Yang
  • Xifeng Yan
  • Subhash Suri
چکیده

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Θ(kn) to Θ(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset. Source codes and datasets: grafia.cs.ucsb.edu/msp

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly

BACKGROUND The rapid advancement of sequencing technologies has made it possible to regularly produce millions of high-quality reads from the DNA samples in the sequencing laboratories. To this end, the de Bruijn graph is a popular data structure in the genome assembly literature for efficient representation and processing of data. Due to the number of nodes in a de Bruijn graph, the main barri...

متن کامل

On Binary de Bruijn Sequences from LFSRs with Arbitrary Characteristic Polynomials

We propose a construction of de Bruijn sequences by the cycle joining method from linear feedback shift registers (LFSRs) with arbitrary characteristic polynomial f(x). We study in detail the cycle structure of the set Ω(f(x)) that contains all sequences produced by a specific LFSR on distinct inputs and provide an efficient way to find a state of each cycle. Our structural results lead to an e...

متن کامل

TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes

Motivation de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). Results In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct con...

متن کامل

De Bruijn Graph Homomorphisms and Recursive De Bruijn Sequences

This paper presents a method to find new de Bruijn cycles based on ones of lesser order. This is done by mapping a de Bruijn cycle to several vertex disjoint cycles in a de Bruijn digraph of higher order and connecting these cycles into one full cycle. We characterize homomorphisms between de Bruijn digraphs of different orders that allow this construction. These maps generalize the well-known ...

متن کامل

A method for constructing decodable de Bruijn sequences

In this paper we present two related methods of construction for de Bruijn sequences, both based on interleaving “smaller” de Bruijn sequences. Sequences obtained using these construction methods have the advantage that they can be “decoded” very efficiently, i.e., the position within the sequence of any particular “window” can be found very simply. Sequences with simple decoding algorithms are...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1207.3532  شماره 

صفحات  -

تاریخ انتشار 2012